Using Python from within R: An R package development perspective

CCHMC Bilingual Data Science Meeting

Cole Brokamp

7/13/23

BUG & RUG present

   

Bilingual Data Science Meeting

   

July 13th, 2023

   

👋 Welcome

Join the RUG Outlook group for updates and events. {width=180%}

Why python?

  • ‘default’ bindings for well established data libraries often first in python
  • “convert this python code to R”
  • working with large datasets on disk
  • machine learning, natural language processing

Bilingual data science

 

Why?

  • streamline data science team workflows
  • concepts (e.g., tidy, ggplot, pandas, etc)
  • value of code/language includes technical and cognitive speed

 

Examples

  • RMarkdown and RStudio
  • Posit
  • Shiny for Python
  • Quarto
  • Connect

Learning python for R users

R-Python data type conversions

R Python Examples
Single-element vector Scalar 1, 1L, TRUE, "foo"
Multi-element vector List c(1.0, 2.0, 3.0), c(1L, 2L, 3L)
List of multiple types Tuple list(1L, TRUE, "foo")
Named list Dict list(a = 1L, b = 2.0), dict(x = x_data)
Matrix/Array NumPy ndarray matrix(c(1,2,3,4), nrow = 2, ncol = 2)
Data Frame Pandas DataFrame data.frame(x = c(1,2,3), y = c("a", "b", "c"))
Function Python function function(x) x + 1
NULL, TRUE, FALSE None, True, False NULL, TRUE, FALSE

reticulate

 

reticulate is the R interface to python

 

https://rstudio.github.io/reticulate/

 

# install.packages("reticulate")
library(reticulate)
os <- import("os")
os$listdir(".")
[1] "index.html"      ".DS_Store"       "figs"            "index.qmd"      
[5] "index.rmarkdown" "codec.scss"     

Configuring python

By default, {reticulate} uses first available non-system Python executable:

Sys.which("python3")
                                              python3 
"/Users/broeg1/.virtualenvs/r-reticulate/bin/python3" 

Alternatively, create or specify Python versions in virtual or Conda environments

virtualenv_create("r-parcel")
virtualenv: r-parcel
use_virtualenv("~/.virtualenvs/r-parcel/")
py_config()
python:         /Users/broeg1/.virtualenvs/r-reticulate/bin/python
libpython:      /Users/broeg1/.pyenv/versions/3.9.12/lib/libpython3.9.dylib
pythonhome:     /Users/broeg1/.virtualenvs/r-reticulate:/Users/broeg1/.virtualenvs/r-reticulate
version:        3.9.12 (main, May 11 2023, 16:29:21)  [Clang 14.0.3 (clang-1403.0.22.14.1)]
numpy:          /Users/broeg1/.virtualenvs/r-reticulate/lib/python3.9/site-packages/numpy
numpy_version:  1.25.1
os:             /Users/broeg1/.pyenv/versions/3.9.12/lib/python3.9

python versions found: 
 /Users/broeg1/.virtualenvs/r-reticulate/bin/python
 /opt/homebrew/bin/python3
 /usr/bin/python3
 /Users/broeg1/.virtualenvs/r-parcel/bin/python

Install

 

python

install_miniconda()
# or
install_python()

Simple install with py_install() will, by defult, be stored within a virtualenv or conda environment named r-reticulate

 

python packages

Create an environment, install packages within it, and then call from R:

virtualenv_install("r-reticulate", "usaddress")
Using virtual environment 'r-reticulate' ...

(Can also be managed with usual python tools.)

usaddress

🇺🇸 a python library for parsing unstructured United States address strings into address components

>>> import usaddress
>>> usaddress.tag('123 Main St. Suite 100 Chicago, IL')
(OrderedDict([
  ('AddressNumber', '123'),
  ('StreetName', 'Main'),
  ('StreetNamePostType', 'St.'),
  ('OccupancyType', 'Suite'),
  ('OccupancyIdentifier', '100'),
  ('PlaceName', 'Chicago'),
  ('StateName', 'IL')]),
'Street Address')

Uses a probabilistic parser trained on real, parsed addresses to return tagged address parts for each address type; e.g.,

Calling python

  1. Python REPL: repl_python()
  2. Python in R Markdown
  3. Sourcing Python scripts
  4. Importing Python modules

 

  • R data types are automatically converted to their equivalent Python types
  • Python environment remains available

Import

Call the usaddress module from R by importing it:

usaddress <- import("usaddress")

Call functions (and other data) within Python modules (and classes) via the $ operator: (This means code completion and inline help are built in!)

usaddress$tag("3333 Burnet Ave Cincinnati OH 45219")
[[1]]
[[1]]$AddressNumber
[1] "3333"

[[1]]$StreetName
[1] "Burnet"

[[1]]$StreetNamePostType
[1] "Ave"

[[1]]$PlaceName
[1] "Cincinnati"

[[1]]$StateName
[1] "OH"

[[1]]$ZipCode
[1] "45219"


[[2]]
[1] "Street Address"

{parcel} package development

 

https://github.com/geomarker-io/parcel

 

Followed best practices suggestions from {reticulate} package authors.

Provide function to install dependencies

https://github.com/geomarker-io/parcel/tree/main#installation

Delay loading python modules

usaddress <- NULL
dedupe <- NULL

.onLoad <- function(libname, pkgname) {
  usaddress <<- reticulate::import("usaddress", delay_load = TRUE, convert = TRUE)
  dedupe <<- reticulate::import("dedupe", delay_load = TRUE, convert = FALSE)
  py <<- reticulate::import_builtins(convert = TRUE)
} 

Carefully check tests for CRAN or other automated checks

skip_if_no_usaddress <- function() {
  have_usaddress <- reticulate::py_module_available("usaddress")
  if (!have_usaddress) {
    skip("usaddress python module not available for testing")
  }
}

test_that("tag_address works", {
  skip_if_no_usaddress()
  tag_address("3333 Burnet Ave Cincinnati OH 45219") |>
    expect_identical(
      tibble::tibble(
        street_number = "3333",
        street_name = "burnet ave",
        city = "cincinnati",
        state = "oh",
        zip_code = "45219"
      )
    )
})

Use your own convert methods as necessary

From parcel:

np <- reticulate::import("numpy", convert = FALSE)
alinks <- np$array(links)
pairs <-
  alinks[["pairs"]] |>
  reticulate::py_to_r() |>
  as.vector()

Thank You

 

🌐 https://colebrokamp.com

👨‍💻️ github.com/cole-brokamp

🐦 @cole_brokamp

📧 cole.brokamp@cchmc.org